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Abstract 

We observe an undirected graph G without multiple edges and self-loops, which is to 
represent a protein-protein interaction (PPI) network. We assume that G evolved un¬ 
der the duplication-mutation with complementarity (DMC) model from a seed graph, 

Go, and we also observe the binary forest P that represents the duplication history of 
G. A posterior density for the DMC model parameters is established, and we outline a 
sampling strategy by which one can perform Bayesian inference; that sampling strat¬ 
egy employs a particle marginal Metropolis-Hastings (PMMH) algorithm. We test our 
methodology on numerical examples to demonstrate a high accuracy and precision in 
the inference of the DMC model’s mutation and homodimerization parameters. 
Keywords: Protein-protein interaction (PPI) network, duplication-mutation with 
complementarity (DMC) model, particle marginal Metropolis-Hastings (PMMH), se¬ 
quential Monte Carlo (SMC). 


1 Introduction 

As a result of breakthroughs in biotechnology and high-throughput experiments thousands 
of regulatory and protein-protein interactions have been revealed and genome-wide protein- 
protein interaction (PPI) data are now available. Protein-protein interactions are one of 
the most important components of biological networks as they are fundamental to the func¬ 
tioning of cells. To gain a better understanding of why these interactions take place, it is 
necessary to view them from an evolutionary perspective. The evolutionary history of PPI 
networks can help answer many questions about how present-day networks have evolved and 
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provide valuable insight into molecular mechanisms of network growth mu- However, in¬ 
ferring network evolution history is a statistical and computational challenging problem as 
PPI networks of extant organisms provide only snapshots in time of the network evolution. 
There has been recent work on reconstructing ancestral interactions (e.g. [5l[7l[TT]). The 
main growth mechanism of PPI network is gene duplication and divergence (mutations) 
m all proteins in a family evolve from a common ancestor through gene duplications and 
mutations and the protein network reflects the entire history of the genome evolution |14) . 
In this paper we follow uni and we develop computational methods to infer the growth 
history and the parameters under the given model incorporating not only the topology of 
observed networks, but also the duplication history of the proteins contained in the net¬ 
works. In their article |10j propose a maximum likelihood approach. The authors establish 
a neat representation of the likelihood function and it is this representation which is used 
in this article. The duplication history of the proteins can be inferred independently by 
phylogenetic analysis mm- 

The approach we adopt here is first to obtain a numerically stable estimate of the likeli¬ 
hood function, under fixed parameters; this achieved via the sequential Monte Carlo (SMC) 
method (see [4] and 0). This approach can then be used to infer the parameters of the 
model, from a Bayesian perspective, as well as the growth history, via a Markov chain Monte 
Carlo (MCMC) method. To the best of our knowledge, this has not been considered in the 
literature, although related ideas have appeared for simpler models in m- Our computa¬ 
tional strategy not only improves on likelihood estimation in comparison to [10) . but also 
provides a natural setup to perform posterior inference on the parameters of interest. 

This article is structured as follows. In Section[2]we detail the model and associated com¬ 
putational method for statistical inference. In Section|^our numerical results are presented. 
In Section |4] the article is concluded with some discussion of future work. 

2 Model and Methods 

We follow similar notation and exposition as in [10] to introduce the protein-protein inter¬ 
action network, its duplication history, and the duplication-mutation with complementarity 
(DMC) model [13]. In particular, the notions of adjacency and duplication are made con¬ 
crete there. We also introduce the associated Bayesian inference problem with which this 
work is primarily concerned (i.e., that of inferring the parameters of the DMC model). We 
then describe a particle marginal Metropolis-Hastings (PMMH) algorithm [T] that can be 
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used to perform such inference. 


2.1 PPI network and DMC model 


Consider an undirected graph G without multiple edges and self-loops, where the nodes 
represent proteins and the edges represent interactions between those proteins. Such a 
graph is called a PPI network, and as in m, we denote the vertex set by V{G), the edge 
set by E{G), and the number of nodes in G by |F(G')|. All nodes which are adjacent to a 
node V (not including v itself) comprise the neighbourhood of v, and that neighbourhood is 
denoted by Nc{v). 

We assume that G evolved from a seed graph Go via a series of duplication, mutation, 
and homodimerization steps under a DMC model. Under the DMC model, at each time 
step t, the graph Gt evolves from Gt-i by the following processes in order: 

1. The anchor node Ut is chosen uniformly at random from U(Gt_i), and a duplicate 

node Vt is added to Gt-i and connected to every member of This is the 

duplication step, and it yields an intermediary graph denoted Gl_i. 

2. For each w € {ut), we uniformly choose one of the two edges in {{ut, w), {vt, w)} C 

E{G^_i) at random and delete it with probability (1 — p). This is the mutation step, 
and the parameter p is henceforth referred to as the mutation parameter. 

3. The anchor node Ut and the duplicate node Vt are connected with probability Pc 
to finally obtain Gt- This is the homodimerization step, and the parameter pc is 
henceforth known as the homodimerization parameter. 


The DMC model is Markovian, and we denote the transition density at time t (which 
encompasses the three afore-mentioned steps) by p>i(G( | Gt-i), where M := {p,Pc)- If we 
assign to a seed graph some prior density pm{Go), then the density of the observed graph 
G will be 


Pm{G) 


E 


PAr(Go) ]^PAr(Gt I Gt_i) , 

t=i 


( 1 ) 


where G = G„, n = |U(G)| — |U(Go)| and H = (Gq, Gi, ..., G„ = G) denotes the collection 
of growth histories. In this work, a seed graph will always be the graph consisting of two 
connected nodes; thus, |U(Go)| = 2 and pm(Go) = 1. Note that we are summing over all 
possible growth histories by which G can evolve from a seed graph. Also note that a growth 
history H induces a unique sequence of duplicate nodes, 0{'H) = {vi,..., Vn) [TP] . 
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2.2 Bayesian inference 

In practice, one will not have access to the parameters {p,Pc) and they must be inferred 
given G. Thus, in the Bayesian setting, our objective is to consider the posterior density 

tt{M\G)o,p{M)pm{G), (2) 

where p{M) is some proper prior for (j),Pc) that we assume can easily be computed (at least 
pointwise up to a normalising constant). 

The total number of growth histories grows exponentially with n [lOj , and so any compu¬ 
tations involving 0 , and thus 0, (e.g. the evaluation ofpx(G)) could potentially become 
very expensive. In the following sections, we reformulate the inference problem in the same 
manner as in [10] to alleviate this issue. 

2.3 Duplication history 

As in uni, let r be a binary forest, i.e. a collection of rooted binary trees. The authors of 
m describe a scheme that encodes the duplication history of a growth history H within a 
series of duplication forests, (TojTi,... ,r„), where each forest Tj corresponds to a graph 
Gt- We describe that scheme here. 

Consider a trivial forest Tq, whose only two isolated trees each consist of a single node. 
Each of those isolated nodes will correspond to a node within the seed graph Go- To build 
El from To, one replaces an anchor node ui from Tq with a subtree, consisting of 

two leaves (wi is the duplicate node included Gi but not Gq). This process continues until 
one builds the series of forests (ro,ri,... ,r„ = T) to correspond to T-L. 

As highlighted in [10], the duplication forest T (corresponding to G) is uniquely deter¬ 
mined by H and a list of anchor nodes, tt = (rti ,... ,Un)- The important thing to emphasize 
here, is that given the duplication forest T and G, one now only needs to infer the duplica¬ 
tion nodes sequence OiH) = {vi,V 2 , ■ ■ ■ ,Vn) to reconstruct the complete growth history H. 
For instance, at the first step backwards, knowledge of r„ = T together with uniquely 
identifies the anchor node thus one can reconstruct Gn-i and r„_i; this is then re¬ 
peated for the remaining backward steps. Thus, given one can construct the growth 

history T-L backward-in-time using the backward operators defined in [TO] Section 2.4], which 
construct Gt_i,rt_i deterministically given {Gt,Tt,Vt)., for t = n,n — 1,..., 1. 
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Figure 1: An example growth history for a network together with the corresponding history 
of the duplication forest. In this example, (ui,M 2 ,M 3 ) = (2,1,3) and (vi,V 2 ,V 3 ) = (3,4,5). 

2.4 Bayesian inference given the duplication history 

Now suppose that in addition to G, a practitioner is given F corresponding to G. Our new 
objective - and the primary inference problem with which this work is concerned - is to 
consider the posterior density 7r(A4|G,r). Notice that we have the joint distribution: 

n 

n{M,{G,T},e)=p{M)pM{Glr^,)l[pMiGt,rt I 

t=l 

where 0 = (ui,..., u„) is a sequence of duplication nodes compatible with the observed G, F, 
and Gq, Fq, ..., G®, F® the corresponding reconstructed history. We are thus interested in 
the parameter posterior: 

TT{M\G,T)o,p{M)pM{G,r), (3) 

n 

Pm{G,T)= Y, P>t(G^,F^)[]p^(G^r®|Gti,rti) , 

0\G,r t=i 

The density PAi(Go,ro) is typically a trivial term which could be ignored in practice. As 
the duplication forest F limits the number of allowable anchor-and-duplicate node pairs, one 
can see that the number of possible growth histories is reduced. 


2.5 Methods 

We will now present an SMC algorithm that can sample the latent growth histories from 
the DMC model given the fixed parameters A4 := {p,pc)- We then show that this algorithm 
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can be employed within a PMMH algorithm, as in [1] , to sample from the posterior ^ and 
infer M. (and even F). 

An SMC algorithm simulates a collection of N samples (or, particles) sequentially along 
the index t via importance sampling and resampling techniques to approximate a sequence 
of probability distributions of increasing state-space, which are known pointwise up to their 
normalising constants. In this work, we use the SMC methodology to sample from the 
posterior distribution of the latent duplication history: 


pm{0 I G,r) ^pMiGlr^o)l[pM{Gt,T^t I GU,rU) 


backwards along the index t via Algorithm in the Appendix. The technique provides an 
unbiased estimate of the normalising constant [31 Theorem 7.4.2], p^(G,r): 


p*,(G,r) = n[lx:iv; 

i=l 


(4) 


where each Wl is an un-normalised importance weight computed in Algorithm Note that 
under assumptions on the model, if A^ > cn for some c < oo, then the relative variance 
of the estimate is 0{n/N); see |3]. It is remarked that, as in m, one could also use the 
discrete particle filter |5], with a possible improvement over the SMC method detailed in 
Algorithm]^ see m for some details. 

This SMC can be employed within a PMMH algorithm to target the posterior of Ai in 
(§. One can think of the deduced method as an MCMC algorithm running on the marginal 
Al-space, but with the SMC unbiased estimate pm{G,T) replacing the unknown likelihood 
Pm{G,T). More analytically, we can consider all random variables involved in the method 
and write down the equilibrium distribution in the enlarged state space, with Al-marginal 
the target posterior (Gj h). Following [T] and letting (/>] denote a sample (Gt,rt) at time 
t, the extended equilibrium distribution is written as: 


„l:Ar 


,1:W 

'0:n-l 


'{l.M. 

^(M,4„_i|G,r) 

iV" 


G,r) = 


(5) 


where is the probability of all the variables associated to Algorithm^ with € 
{!,..., A^} and the (jAs being the simulated variables at each step of Algorithm 

A PMMH algorithm (see Algorithm]^ samples from ([^, and one can remove the auxil¬ 
iary variables from the samples to obtain draws for the parameters from (§. Furthermore, 
one could even save the sampled growth histories with particle index I to obtain draws from 
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the joint posterior Tr(M, 9 \ G, F). However, in this work, we are primarily interested in the 
inference of M.. 


3 Results 

The variance of the estimate Q plays a crucial role in the performance of Algorithm as 
@ is used to compute the acceptance probability within the PMMH algorithm. Thus, we 
first tested the variability of Q as computed by Algorithm[^to understand how the variance 
changes with |H(G)|. We then ran Algorithm to sample from the posterior ([^ and infer 
M. for a given pair of observations (G,r). We present the details of those experiments 
below. 

3.1 Variance of r) 

We simulated a graph G and a forest T from the DMC model m with the parameters set 
as (p = 0.7,Pc = 0.7), where |H(G)| = 40. We saved each pair {Gt,Tt) for 1 < t < 40, and 
we ran Algorithm fifty times per pair (with N = |H(Gt)| * 5) to compute fifty unbiased 
estimates of PMiGt,7't) for 1 < t < 40. In the top of Figurein Section]^ we plot the 
relative variance of the estimate (or, the variance divided by the square of the expected value) 
per each value of |H(Gt)|. We repeated the experiment two more times, with V = |H(Gt)|*10 
and N = \ V{Gt) \ * 20, and the associated output is also presented in Figure 

As remarked above, if V > cn for some c < oo, then the relative variance of pM{Gt, Ft) 
is 0{n/N). Figureconfirms that the variance increases linearly and that increasing the 
value of N with |H(G)| (at least linearly) will help to control the variance. However, the 
plots show that the relative variance is still high, which means that N will have to be large 
to ensure satisfactory performance of the PMMH in practice. 

3.2 Parameter inference 

We separately simulated a graph G and a forest F from the DMC model with the parameters 
again set as {p = 0.7,Pc = 0.7), and we set |H(G)| = 15. Given only (G,F), we inferred 
(p,Pc) with each parameter having a uniform prior on the interval [0.1,0.9]. We set the 
number of particles within Algorithm to he N = 2000, and we ran the PMMH algorithm 
to obtain 10,000 samples from the extended target. 

Figure in Section illustrates good mixing of the PMMH algorithm and accurate 
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inference of the parameters {p,pc). The trace plots show that the algorithm is not sticky, 
and the autocorrelation functions give evidence to an approximate independence between 
samples. The posterior densities are also interesting, in that they are clearly different from 
the uniform priors and they show that the PMMH algorithm spends a majority of the 
computational time sampling the true parameter values. 

4 Discussion 

We have introduced a Bayesian inferential framework for the DMC model [14], where, as 
in m, one assumes the pair (G, T) is observed and the parameters {p,Pc) are unknown. 
We then described how an SMC algorithm can be used to simulate growth histories and 
ultimately be employed within PMMH to target the posterior distribution of the parameters 
([^, thereby opening up the possibility of performing Bayesian inference on the DMC model. 

Numerical tests demonstrated that Algorithmcan have a high variability when |H(G)| 
is large and N is not sufficiently high, and this limits the scope of the inference problem 
which can be tackled using the complete Algorithmic However, the proposals used in the 
example experiments within Algorithms [C and [C are naive, as the method simply chooses a 
candidate duplicate node vl at random from all permitted nodes given the current GJ, TJ. It 
is reasonable to assume that more sophisticated proposal densities could reduce the variance 
of the SMC and/or improve the mixing of the PMMH, thereby allowing one to perform 
inference when |H(Gt)| is large and N is smaller. This could be explored in a future work. 
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A Figures 


Variability of 





Figure 2: All plots illustrate the relative variance of Pm per \V{Gt)\ on the horizon¬ 

tal axis; the relative variance is the variance divided by the square of the expected value. In 
the top plot, the number of SMC particles used to compute each pM{Gt,Tt) is |V(G't)| * 5. 
In the middle and bottom plots, that number is |V(Gt)| * 10 and |V(Gt)| * 20, respectively. 
Recall that the seed graph, Gg, has two nodes, and note that we did not compute Pm (Gq, Fq) 
because it is trivial. 
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Parameter inference 






Figure 3: Plots associated with p (pc) are at left (right). The top figures are trace plots, 
with PMMH iteration running along the horizontal axes and parameter value running along 
the verticals. The middle figures are plots of the autocorrelation functions (with lag running 
along the horizontal axes), and at the bottom we present the parameter posterior densities. 
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B Algorithm Summaries 


Algorithm 1 Sequential Monte Carlo (SMC) 

• Step 0: Input an observed graph G = Gn and a corresponding observed forest T = r„, 

where G is not a seed graph. 

• Step 1: Set t = n. For i S {1,..., A}, sample a subtree with two nodes uniformly at 
random from F), and chose one of the two nodes uniformly as the proposed duplicate 
node vl (thus the other will be the anchor node). Using the backward operators defined 
in nni Section 2.4], construct each {G\_i,V\_^) from the subtrees and {G\,T\). For 
i G {1,..., N}, compute the un-normalised weight 

PM{Gi,ri\ Gi_„ri_,) 

-> 

Qm{vI) 

where q _\4 is the density of the proposal mechanism used to sample {u)}. 

• Step 2: If {G(A} ^.re not seed graphs, then set t = t — 1 and continue to Step 3. 
Otherwise, the algorithm terminates. 

• Step 3: For i S {!,..., A}, sample a) G {!,..., A} from a discrete distribution on 

{!,..., A} with probability wj oc IF/. The sample are the indices of the 

resampled particles. Set all normalised weights equal to A“^. 


• Step 4: For i G A}, sample a subtree with two nodes uniformly at random from 

the resampled forest F^*, and select uniformly one of the two nodes as the proposed 
duplicate node v). Construct (G)_^,r)_;^) from vl, (G“*,r“‘). For i G {!,...,A}, 
compute the un-normalised weight 

— ( i\ 


Return to Step 2. 
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Algorithm 2 Particle Marginal Metropolis-Hastings (PMMH) 

• Step 0: Set r = 0. Sample ~ p{-). All remaining random variables can be 

sampled from their full conditionals defined by the target ([^: 

- Sample ~ ^' 7 w(’-)(-) via Algorithmj^ 

- Choose a particle index cx 
Finally, calculate p^(r)(G,P) via Q. 

• Step 1: Set r = r + 1. Sample A4* ~ q{-\M.). All remaining random variables can be 
sampled from their full conditionals defined by the target ([^; 

- Sample (t>*Q^n-n via Algorithmj^ 

- Choose a particle index 1 * oc . 

Finally, calculate P 7 V(»(G, F) via Q. 


Step 2: With acceptance probability 


1 A 


PM>{G,V)q{M\M*) 

PM^r-^,{G,T)q{M^\My 


set Otherwise, set 

((<•->, M‘--\ «S'bf) = ((<•-■>, xU’-i, «S'Ab^"), 


Return to the beginning of Step 1. 
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